fix(fiber): split DA Submit at Fibre's 128 MiB upload cap + duration log#3307
Conversation
The Fibre Submit path was opaque: failures showed up as DeadlineExceeded with no signal of how long the upload actually took, and successes only logged at debug level inside the upstream library. During load-test debugging this turned into a guessing game — was the cluster slow, the deadline too tight, or something stuck mid-RPC? Add a single info-level (warn-on-failure) log line in fiberDAClient.Submit covering the Upload call: duration, flat blob bytes, blob count. Cheap (one time.Since) and gives the operator concrete numbers — e.g. "17 blobs / 115 MiB / 1.5 s" — to reason about whether RPCTimeout, pending cap, or batch sizing is the right knob to turn next.
Under sustained txsim load (~50 MiB/s) the DA submitter
batched 10 block_data items into one Upload(), producing a
flat payload of 144 MiB. Fibre's per-upload cap is hard at
~128 MiB ("blob size exceeds maximum allowed size: data
size 144366912 exceeds maximum 134217723") and rejected
every batched upload. With MaxPendingHeadersAndData=10
that took down 170 consecutive submissions before the
node halted itself with "Data exceeds DA blob size limit".
Wrap the Upload call in a chunker that groups input blobs
into ≤120 MiB chunks (8 MiB headroom under Fibre's cap for
the per-blob length-prefix overhead added by flattenBlobs)
and uploads each chunk separately. Aggregates submitted
counts and BlobIDs across chunks; on first chunk failure,
returns the error with the partially-submitted count so
the submitter's retry/backoff logic sees a coherent state
instead of all-or-nothing.
Single oversized blobs (already validated against
DefaultMaxBlobSize earlier in Submit) still land alone and
fail server-side, but at least don't drag healthy peers
into the same rejected batch.
Companion to the submitter chunking fix. The submitter can
split a multi-blob batch into ≤120 MiB Fibre uploads, but
a *single* block_data item that exceeds 128 MiB still ends
up alone in its own chunk and fails server-side ("blob size
exceeds maximum allowed size"). Lower the per-block cap to
100 MiB so under high-throughput txsim a single block can't
grow past Fibre's hard limit, and update the comment to
explain the relationship between this cap and Fibre's
~128 MiB upload reject threshold.
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| // Fibre Upload call. Fibre rejects payloads above ~128 MiB | ||
| // ("data size N exceeds maximum 134217723"); 120 MiB leaves slack for | ||
| // flattenBlobs's per-blob length prefixes and for any future overhead. | ||
| const fibreUploadChunkBudget = 120 * 1024 * 1024 |
There was a problem hiding this comment.
Makes sense, https://github.com/evstack/ev-node/blob/main/block/internal/common/consts.go#L11-L12 would be pre-prefixes 120mb.
Nice catch!
| // Set the per-block data cap below that so each block_data item | ||
| // fits in a single Fibre upload after the submitter splits a | ||
| // multi-blob batch into ≤120 MiB chunks. | ||
| block.SetMaxBlobSize(100 * 1024 * 1024) |
There was a problem hiding this comment.
👍🏾 can we update the ldflags here for consistency: https://github.com/celestiaorg/x402-risotto/blob/main/scripts/run-stack.sh#L61 ?
Issue
Under sustained txsim load the DA submitter batched up to 10 pending data items into a single
Upload()call, producing a flat payload of ~144 MiB. Fibre's per-upload server-side cap is hard at ~128 MiB (blob size exceeds maximum allowed size: data size 144366912 exceeds maximum 134217723) and rejected every batched upload. WithMaxPendingHeadersAndData=10that took down 170 consecutive submissions before the daemon halted itself withData exceeds DA blob size limit.The Submit path also had no per-call observability — failures showed up as
DeadlineExceededoroversized blobafter the fact, with no measurement of how long uploads actually took. During load-test debugging this turned into a guessing game over whether RPCTimeout, pending cap, or batch sizing was the right knob to turn next.Solution
fiberDAClient.Submit: wrap thefiber.Uploadcall in a chunker (chunkBlobsForFibre) that groups input blobs into ≤120 MiB chunks (8 MiB headroom under Fibre's 128 MiB cap forflattenBlobs's per-blob length-prefix overhead) and uploads each chunk separately. Aggregates submitted counts and BlobIDs across chunks; on first chunk failure, returns the error with the partially-submitted count so the submitter's retry/backoff sees a coherent state.time.Since) and gives the operator concrete numbers — e.g.17 blobs / 115 MiB / 1.5 s— to reason about whether the upload pipeline or something downstream is the bottleneck.evnode-fibre:block.SetMaxBlobSize(120 → 100 MiB). Companion safety: after the chunker splits a multi-blob batch, a single oversized blob would still end up alone in its own chunk and fail server-side. Capping per-block data at 100 MiB ensures even a single block_data item fits in one Fibre upload.Test plan
data size N exceeds maximum 134217723rejections under sustained loadsingle item exceeds DA blob size limithalts